New packages to install for this practicum:
install.packages("devtools")
library(devtools)
install_github("wilkelab/cowplot")Good visualization is critical to present research findings in an intuitive and appealing way, but it can also enhance our ability to understand data as we analyze it and uncover unexpected patterns. In this practicum, we will learn more about the ggplot2 package, which has greatly simplified the production of high quality graphics in R. The official ggplot2 documentation is helpful to give an idea of how much this package can do. You can also read more in this overview and/or the data visualization chapter of the “R for Data Science” book. Samples of plots that can be made with ggplot2 with accompanying code are available at sites such as UBC Stat shiny server. We will use (1) iris, a classic R built-in dataset that contains measures of sepal length and width, and petal length and width for 50 flowers from each of 3 species of iris (setosa, versicolor, virginica), and (2) the NHANES dataset we’ve used before.
library(dplyr)
library(ggplot2)
nhanes <- read.csv(url("https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/NHANES_2007to2008.csv"), header=TRUE)
nhanes <- rename(nhanes, id=SEQN, gender=RIAGENDR, age=RIDAGEYR, race=RIDRETH1, education=DMDEDUC2, income=INDHHIN2, health.provider=HUQ040, wheezing=RDQ070, asthma=MCQ010, voc=WTSVOC2Y, fvc=SPXNFVC, fev1=SPXNFEV1) %>%
mutate(gender=factor(gender, levels=c(1, 2), labels=c("male", "female"))) %>%
mutate(race=factor(race, levels=c(3, 1, 2, 4, 5), labels=c("white", "MexicanAmerican", "Hispanic", "black", "other"))) %>%
mutate(asthma=ifelse(asthma %in% c(7,9), NA, ifelse(asthma==2, 0, 1))) %>%
filter(!is.na(asthma))
str(nhanes)## 'data.frame': 9659 obs. of 12 variables:
## $ id : int 41475 41476 41477 41478 41479 41480 41481 41482 41483 41485 ...
## $ gender : Factor w/ 2 levels "male","female": 2 2 1 2 1 1 1 1 1 2 ...
## $ age : int 62 6 71 1 52 6 21 64 66 30 ...
## $ race : Factor w/ 5 levels "white","MexicanAmerican",..: 5 5 1 1 2 2 4 2 4 3 ...
## $ education : int 3 NA 3 NA 1 NA 3 2 4 2 ...
## $ income : int 6 15 5 3 8 7 6 15 5 6 ...
## $ health.provider: int 1 2 1 2 2 1 1 1 2 NA ...
## $ wheezing : int 1 2 2 2 2 2 2 2 2 2 ...
## $ asthma : num 1 0 0 0 0 0 0 0 0 0 ...
## $ voc : num 123988 NA 20915 NA NA ...
## $ fvc : int 2473 1742 NA NA 3545 1836 4928 3907 3170 NA ...
## $ fev1 : int 2025 1551 NA NA 2957 1691 4541 2920 2355 NA ...
str(iris)## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
nhanes %>%
mutate(adult.status = factor(age>=18, labels=c("child", "adult"))) %>%
filter(adult.status=="child") %>%
ggplot(aes(x=age, y=fev1)) +
geom_point() +
geom_smooth(method = "lm", color="red")We’ll look at more options related to ggplot by adding to the layers created before. More options related to setting background, axis, title, and legend options are available in theme. The default settings for ggplot will appropriately select color type (although not necessary color blind or black-and-white printer friendly).
ggplot(data=iris, aes(Sepal.Length, Petal.Length, color=Petal.Width)) +
geom_point() +
labs(title="Iris Characteristics") +
labs(x="Sepal Length", y="Petal Length") + #Adds a layer with labels
xlim(c(4, 8)) + #Changes x axis limits
ylim(c(1, 8)) #Changes y axis limitsChange color scale.
ggplot(data=iris, aes(Sepal.Length, Petal.Length, color=Petal.Width)) +
geom_point() +
scale_color_gradient(low="#fee6ce", high="#e6550d") +
labs(title="Iris Characteristics") +
labs(x="Sepal Length", y="Petal Length") + #Adds a layer with labels
xlim(c(4, 8)) + #Changes x axis limits
ylim(c(1, 8)) + #Changes y axis limits
theme_bw() #Changes theme to black and whiteThis time we can view up to four dimensions of our data by setting color and size, in addition to x and y variables.
ggplot(data=iris, aes(Sepal.Length, Petal.Length, color=Species, size=Petal.Width)) +
geom_point() +
labs(title="Iris Characteristics") +
labs(x="Sepal Length", y="Petal Length") + #Adds a layer with labels
xlim(c(4, 8)) + #Changes x axis limits
ylim(c(1, 8)) + #Changes y axis limits
theme_bw() #Changes theme to black and whiteWhen creating barplots, use geom_bar with default option stat=bin to get counts in each category. To split panels according to another categorical variable facet_grid can be used. More color options that work for all chart types are described here.
ggplot(data=nhanes, aes(x=race, fill=factor(asthma))) +
geom_bar(position="dodge")ggplot(data=nhanes, aes(x=race, fill=factor(asthma))) +
geom_bar(position="dodge") +
facet_grid(. ~gender) + #Split by another variable
theme(axis.text.x=element_text(angle=45, hjust=1))ggplot(data=nhanes, aes(x=gender, fill=factor(asthma))) +
geom_bar() +
scale_fill_brewer(palette="Set1") #ColorBrewer palettesggplot(data=iris, aes(Species, Sepal.Length)) +
geom_boxplot() ggplot(data=iris, aes(Species, Sepal.Length)) +
geom_boxplot() +
geom_jitter() #Add a second layer of jittered data pointsggplot(data=iris, aes(Species, Sepal.Length)) +
geom_boxplot(outlier.colour="red", outlier.size=4) #Making outliers more noticeableggplot(data=iris, aes(Species, Sepal.Length)) +
geom_violin() #violin plotsggplot(data=iris, aes(Species, Sepal.Length)) +
geom_violin(fill="lightblue", draw_quantiles = c(0.25, 0.5, 0.75)) +
geom_jitter(height=0, width=0.1)ggplot(data=iris, aes(Species, Sepal.Length)) +
geom_boxplot(aes(fill=Species)) #Coloring according to speciesggplot(data=nhanes, aes(gender, voc)) +
geom_boxplot(color="darkgreen", fill="darkgreen", alpha=0.5)ggplot(data=nhanes, aes(gender, voc)) +
geom_boxplot(aes(fill=factor(asthma))) #Add another variable with fillggplot(data=nhanes, aes(age)) +
geom_histogram(breaks=seq(0, 80, 1), color="blue", fill="blue", alpha=0.7) #Density plot
ggplot(data=nhanes, aes(age)) +
geom_histogram(aes(y=..density..), breaks=seq(0, 80, 1), color="blue", fill="blue", alpha=0.7) +
geom_density(color="red")When many points are part of a scatter plot, it can be difficult to see patterns that represent major trends. The geom_smooth command (similar to geom_stat) will plot smooth conditional means. The smoothing function can be specified with the method option as “lm”, “glm”, “gam”, “loess”, “nls”.
ggplot(nhanes, aes(x=age, y=fev1)) +
geom_point(color="grey") +
geom_smooth(color="red") #Default option is method="auto" which selects loess for <1000 points and gam with formula y~s(x, bs="cs") otherwiseggplot(nhanes, aes(x=age, y=fev1)) +
geom_point(color="grey") +
geom_smooth(color="red") +
geom_smooth(color="black", method="lm") +
geom_smooth(color="green", method="loess")Additionally, a formula can be given for use in the smoothing function.
x <- 1:10
y <- jitter(x^2)
DF <- data.frame(x, y)
ggplot(DF, aes(x=x, y=y)) +
geom_point() +
geom_smooth(method='lm', aes(color = 'linear'), se=FALSE) +
geom_smooth(method='lm', formula = y ~ poly(x, 2), aes(color = 'polynomial'), se=FALSE) +
geom_smooth(method='nls',
formula = y ~ a*log(x) + b,
method.args = list(start = c(a=1, b=1)),
aes(color = 'logarithmic'),
se=FALSE) +
geom_smooth(method='nls',
formula = y ~ a*exp(b*x),
method.args = list(start = c(a=1, b=1)),
aes(color = 'Exponential'),
se=FALSE)When many points are in a 2D scatter plot, it can similarly be difficult to tell where the majority of points are. Density plots can help us interpret data relationships when there is overplotting.
ggplot(nhanes, aes(x=fvc/1000, y=fev1/1000)) +
geom_point(color="grey") +
stat_density2d(aes(fill=..level.., alpha=..level..), geom="polygon", color="black") +
scale_fill_continuous() +
guides(alpha="none")If the contours are left out, then stat_density2d can be used to produce a raster of the data.
ggplot(nhanes, aes(x=fvc/1000, y=fev1/1000)) +
stat_density2d(aes(fill=..density..), geom="raster", contour=FALSE) +
scale_fill_continuous(low="#238443", high="#ffffcc") +
guides(alpha="none")viridis color schemeThis color scheme is a personal favorite for use with heatmaps and other 2D plots that, as of 2018, is part of ggplot2.
ggplot(nhanes, aes(x=fvc/1000, y=fev1/1000)) +
stat_density2d(aes(fill=..density..), geom="raster", contour=FALSE) +
scale_fill_viridis_c() +
guides(alpha="none")To save graphical output to a file in R, one opens a graphics device to redirect output of plots to them, and when plots are completed, one closes the device to save the final version. To see all available graphical devices, type ?Devices. The default device in R is usually an interactive window, like the one we see in RStudio. Here are two examples using common formats.
options("device") #To see the current graphics device## $device
## function (width = 7, height = 7, ...)
## {
## grDevices::pdf(NULL, width, height, ...)
## }
## <bytecode: 0x7faf09457d78>
## <environment: namespace:knitr>
fig <- ggplot(data=iris, aes(Species, Sepal.Length)) +
geom_boxplot(color="darkblue", fill="blue")
#Save as PDF
pdf("Ex1.pdf", width=6, height=5) #Units are inches
fig
dev.off()## quartz_off_screen
## 2
#Save as TIFF
tiff("Ex1.tiff", width=480, height=480, units="px")
fig
dev.off()## quartz_off_screen
## 2
ggplot2 offers a simplified function to save graphics output: ggsave.
figggsave("Ex2.pdf") #Default plot to save is last plot
ggsave("Ex2.tiff", plot=fig)The theme used with ggplot2 can be customized to suit your personal taste. One researcher created an R package called cowplot as a ggplot2 add-on to easily create publication-ready graphics.
library(cowplot)
save_plot("Ex3.pdf", fig, base_aspect_ratio=1.3) #Save using the default format of cowplot